A count-based model for delineating cell–cell interactions in spatial transcriptomics data

Abstract Motivation Cell–cell interactions (CCIs) consist of cells exchanging signals with themselves and neighboring cells by expressing ligand and receptor molecules and play a key role in cellular development, tissue homeostasis, and other critical biological functions. Since direct measurement of CCIs is challenging, multiple methods have been developed to infer CCIs by quantifying correlations between the gene expression of the ligands and receptors that mediate CCIs, originally from bulk RNA-sequencing data and more recently from single-cell or spatially resolved transcriptomics (SRT) data. SRT has a particular advantage over single-cell approaches, since ligand–receptor correlations can be computed between cells or spots that are physically close in the tissue. However, the transcript counts of individual ligands and receptors in SRT data are generally low, complicating the inference of CCIs from expression correlations. Results We introduce Copulacci, a count-based model for inferring CCIs from SRT data. Copulacci uses a Gaussian copula to model dependencies between the expression of ligands and receptors from nearby spatial locations even when the transcript counts are low. On simulated data, Copulacci outperforms existing CCI inference methods based on the standard Spearman and Pearson correlation coefficients. Using several real SRT datasets, we show that Copulacci discovers biologically meaningful ligand–receptor interactions that are lowly expressed and undiscoverable by existing CCI inference methods. Availability and implementation Copulacci is implemented in Python and available at https://github.com/raphael-group/copulacci.


Introduction
Cell-cell interactions (CCIs) are fundamental to many biological processes in multicellular organisms including cellular differentiation (Amprino 1953, Kirouac et al. 2010), homeostasis (Wu et al. 2018), immune response (Baccin et al. 2020), wound healing (Jin et al. 2021), and response to disease condition (Kveler et al. 2018).For example, interactions between tumor cells and nearby endothelial cells lead to the formation of blood vessels in the tumor microenvironment (Rivera and Bergers 2015), enabling tumor growth and metastasis (Nishida-Aoki and Gujral 2019, Lu et al. 2012).Cells communicate with each other by exchanging biochemical signals (Skinner 1991) where a ligand emitted by one cell binds to a receptor on the surface of another cell.Thus, identifying the ligand and receptor molecules that mediate CCIs inside a tissue is an important problem with many biological applications (Miller and Lappin 2020).
Direct measurement of CCIs-e.g. through biophysical imaging, immunofluorescence staining (Leineweber 2023), and other technologies-is technologically challenging and does not readily scale to thousands of ligand and receptor molecules inside a single tissue.Thus, an alternative approach is to infer CCIs from the co-expression of ligand and receptor in bulk RNA-seq (Choi et al. 2015) or single-cell RNA-sequencing (scRNA-seq) (Hou et al. 2020, Cabello-Aguilar et al. 2020, Hu et al. 2021, Wang et al. 2019).These methods typically rely on a set of ligands and receptors from the manually curated database (T€ urei et al. 2021, Efremova et al. 2020, Jin et al. 2021) of validated or predicted CCIs.
An important limitation of using scRNA-seq data to infer CCIs is that scRNA-seq technologies dissociate cells before measurement, thus losing the spatial positions of individual cells, which are a key constraint on communicating cells.As a result, scRNA-seq-based methods for CCI inference tend to infer many false positive interactions, as they are unable to distinguish spatially adjacent cells that are interacting versus cells that are far from one another and have correlated ligand and receptor expressions (Bafna et al. 2023).Recent developments in spatially resolved transcriptomics (SRT) technologies (Williams et al. 2022, Rao et al. 2021) have led to the collection of high-throughput transcriptomics measurements together with the spatial locations of individual cells, enabling potentially more accurate inference of CCIs.
Several computational methods have been developed to infer CCIs from SRT data, generally following one of two different approaches.The first approach is to identify ligandreceptor pairs with a large value of a spatially aware coexpression score (Miller et al. 2021, Li et al. 2023, Pham et al. 2023, Dries et al. 2021) which is typically based on the Pearson correlation coefficient or the Moran's I autocorrelation measure used in geospatial statistics (Getis and Ord 1992).However, such co-expression scores are challenged by the sparsity of current SRT technologies, and thus these methods struggle to identify interacting ligands and receptors with low unique molecular identifier (UMI) counts.Moreover, these methods identify only putative interacting ligands and receptors and do not identify the specific regions of a tissue where the interactions take place.The second approach, which has been followed by a few recent methods, is to explicitly model the biological cell signaling process inside a tissue, e.g. using Gaussian processes (Arnol et al. 2019) or optimal transport (Cang andNie 2020, Cang et al. 2023).However, these methods must make broad assumptions about the biophysical process of inter-cellular communication, and as a result, are computationally inefficient.Moreover, none of the existing CCI inference methods account for the fact that SRT measures discrete UMI counts, as all of these approaches transform the counts to continuous values.However, many papers in the single-cell analysis literature have shown that such transformations introduce unwanted zero inflation and other biases in downstream analyses (Townes et al. 2019, Booeshaghi andPachter 2021).
We introduce Copulacci, a copula-based method for inferring CCIs from sparse SRT data.We model the bivariate distribution of expression counts of each ligand and receptor using copulas, which encode interactions between random variables while allowing for arbitrary marginals.In particular, we use a bivariate Gaussian copula distribution, a widely used and interpretable distribution, and we use Poisson marginals to account for the discrete and sparse UMI counts in SRT data.We derive a computationally efficient optimization algorithm to maximize the likelihood of our model.The estimated parameters of our model reveal CCIs, i.e. which ligands and receptors are interacting.Moreover, Copulacci computes an interaction score for a ligand-receptor pair between pairs of nearby spatial locations, revealing the spatial regions where CCIs occur.
On simulated data with varying amounts of data sparsity, Copulacci yields demonstrably more accurate identification of CCIs compared to existing CCI inference methods, including the Pearson and Spearman correlation coefficients, with Copulacci having better performance on sparser data.The improved performance of Copulacci on sparse data is because Copulacci models the distribution of ligand and receptor co-expression, in contrast to conventional CCI inference methods which only learn a point estimate for the dependency between ligand and receptor co-expression.On SRT data from a human breast cancer (10x Visium), mouse cortex (seqFISHþ), and mouse embryo (Stereo-seq), we show that Copulacci discovers many biologically relevant CCIs that are not identified by existing approaches due to low expression of ligands and/or receptors.

Cell-cell interaction inference problem
We start by formalizing the problem of identifying CCIs from SRT data.Suppose we are given UMI counts ≥0 for a ligand and receptor, respectively, across N spatial locations (spots) in a tissue slice, as well as the spatial location s i 2 R 2 of each spot i ¼ 1; . . .; N. We emphasize that each spot i has two measurements: a ligand expression ' i and a receptor expression r i .
We describe the potential CCIs with a directed graph G ¼ ðV; EÞ.The vertices V ¼ f1; . . .; Ng represent the spots and each directed edge ði; jÞ 2 E represents a potential CCI where a ligand expressed in spot i may bind with a receptor expressed in spot j.Typically the directed graph G is symmetric, meaning that ði; jÞ 2 E implies ðj; iÞ 2 E; that is, if a ligand from spot i is able to bind with a receptor in spot j, then a ligand in spot j is also able to bind with a receptor in spot i.
For this reason, one may equivalently describe the potential CCIs with an undirected graph, where an undirected edge ði; jÞ in an undirected graph corresponds to two directed edges ði; jÞ; ðj; iÞ 2 E in a directed graph G.We will use the undirected graph representation when it is clear from the context.
The inclusion of an edge ði; jÞ 2 E typically depends on the interaction range of the ligand and receptor and the physical distance between spots i and j.For example, most signaling occurs between cells that are within 100-200 μm of one another, and thus the edges E connect spots that are at most 200 μm apart.In SRT data measured using the 10x Genomics Visium technology, each spot lies on a hexagonal grid and adjacent spots are 100 μm apart, and so one natural choice for the graph G is a connected subgraph of the hexagonal lattice.The graph may also be weighted, where each edge ði; jÞ 2 E has a weight w i;j reflecting the physical distance between spots i and j.
The standard approach to identify CCIs mediated by a ligand and receptor is to test whether the ligand expression ' i and receptor expression r j across spatially adjacent spots ði; jÞ 2 E are dependent on each other or not.More formally, given samples fð' i ; r j Þg ði;jÞ2E of a pair of random variables ðL; RÞ representing a ligand L and a receptor R, we aim to estimate the "dependency" between L and R, where we use quotes to indicate different possible choices for measuring the dependency.
We formalize this task as the following computational problem.

CCI inference problem
Given a spatial adjacency graph G ¼ ðV; EÞ, and samples fð' i ; r j Þg ði;jÞ2E of a bivariate random variable ðL; RÞ, estimate a dependency score between L and R.
Nearly all existing CCI inference methods solve the CCI inference problem (CCIIP) with the dependency score corresponding to the following correlation coefficient: q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P j ðr j − � rÞ 2 q ; (1) for some choice of edge weights w i;j and normalizing constant c, where � '; � r are the mean expression of ligand and receptor, respectively.
We call ρ spatial a generalized Pearson correlation measure, as ρ spatial is equal to the Pearson correlation coefficient when c ¼ 1 and G is an unweighted graph, i.e. w i;j ¼ 1 for all edges ði; jÞ 2 E. The generalized Pearson correlation measure ρ spatial is sometimes called a bivariate Moran's I measure in the single-cell literature, as it bears close similarity to the Moran's I statistic used to measure autocorrelation in geospatial statistics.For example, MERINGUE (Miller et al. 2021) solves the CCIIP with a generalized Pearson correlation measure ρ SCC , which they call a spatial cross-correlation measure, where G is an unweighted graph with an edge between two spots if they are adjacent in a Delauney triangulation and c ¼ N jEj .SpatialDM (Li et al. 2023) solves the CCIIP using the generalized Pearson correlation measure ρ SpatialDM where c ¼ i482 Sarkar et al.
1 and the spatial adjacency graph G is a complete graph with edge weights , where d i;j ¼ jjs i −s j jj 2 is the physical distance between spot i and spot j.
A major limitation of the generalized Pearson correlation measure ρ spatial is that it does not account for the low coverage and sparsity of current SRT technologies.For example, the standard Pearson correlation coefficient is known to have reduced power for non-normally distributed data (Bishara and Hittner 2012), and the distribution of sparse UMI counts is not close to being normally distributed.More generally, the underlying issue is that nearly all existing approaches for identifying CCIs derive a dependency score for each ligandreceptor pair that does not model the joint distribution of ligand and receptor expression values.
A natural approach for identifying CCIs would be to derive a bivariate count-based distribution to model the coexpression of both a ligand and receptor gene.To this end, several count-based distributions, including Poisson and negative binomial distributions, have been used to model sparse UMI counts of individual genes in single-cell RNA-sequencing data (Townes et al. 2019, O'Hara and Kotze 2010, Hafemeister and Satija 2019).However, there are numerous challenges in extending such a univariate count distribution to a bivariate distribution.In particular, unlike the multivariate normal distribution which has a well-established canonical form, the multivariate Poisson and negative binomial distributions lack a universally accepted definition (Inouye et al. 2017).For example, Campbell (1934) defined a bivariate Poisson distribution as a summation of independent Poisson variables.However, this definition only permits positive dependency between variables.Other definitions of bivariate count distributions capable of accommodating both positive and negative dependency, such as the bivariate negative binomial distributions defined by Subrahmaniam and Subrahmaniam (1973) and Mitchell and Paulson (1981), require computationally intensive inference procedures for parameter estimation (Inouye et al. 2017).
To address these challenges we introduce Copulacci (Fig. 1), a framework for using copula distributions to model the joint distribution of ligand expression and receptor expression.Copulas are a general framework for modeling multivariate distributions as they can model arbitrary count distributions including Poisson or negative binomial distributions; include both positive and negative correlations; and are efficient to fit in practice (see Inouye et al. 2017 for details).We note that copula distributions have previously been used by the scDesign framework to model and simulate UMI count matrices in single-cell transcriptomics data (Sun et al. 2021, Song et al. 2023).

A bivariate copula model for cell-cell interactions
Copulacci models the joint distribution Pð' i ; r j Þ of adjacent ligand and receptor expressions using a bivariate copula.A bivariate copula is a function that encodes dependency relationships (positive or negative) between two random variables L and R, while allowing for arbitrary marginal distributions PðLÞ; PðRÞ on the random variables.
Formally, a bivariate copula C : ½0; 1� 2 !½0; 1� is a function that is equal to the joint cumulative distribution function (CDF) of two random variables with uniform marginals.The benefit of copulas comes from Sklar's theorem, which states that any bivariate distribution on random variables ðL; RÞ with joint CDF Fð'; rÞ satisfies Fð'; rÞ ¼ CðF L ð'Þ; F R ðrÞÞ; (2) for a copula function Cð'; rÞ, where F L ð'Þ, F R ðrÞ are the marginal CDFs of the respective random variables L, R.
From Equation (2), in order to define the joint distribution Pð' i ; r j Þ of ligand and receptor UMI counts in adjacent spots ði; jÞ 2 E, which is equivalent to defining the joint CDF Fð'; rÞ, we must specify (1) a copula function Cð'; rÞ and (2) the marginal distribution CDFs F L ð' i Þ; F R ðr j Þ of the ligand and receptor expression values, respectively.
For our applications, we use a Gaussian copula is parametrized by a parameter ρ which describes the correlation between the ligand and receptor expression values.For this reason, we call ρ the copula correlation coefficient.For notational simplicity, we write C Σρ as C ρ .

Marginal distributions
We model the marginal distributions of the ligand and receptor expression values ' i ; r j , respectively, with Poisson distributions of the form ' i � PoisðN i e μ ' Þ and r j � PoisðN j e μ r Þ, where N i is the total UMI count at spot i and μ ' ; μ r are the (normalized) mean expression values for genes '; r, respectively.Poisson distributions have been previously used to model sparse UMI counts in single-cell and SRT (Townes et al. 2019, Ma et al. 2022, Chitra et al. 2023).However, we emphasize that one may use other count-based distributions too, e.g.negative binomial or multinomial distributions (Townes et al. 2019, Baker 1994).We use F L ð' i jμ ' Þ and F R ðr i jμ r Þ to refer to the marginal CDFs of the Poisson distributions for the ligand and receptor expressions, respectively.

Maximum likelihood
From Equation (2), the joint CDF over the observed ligand, receptor expressions ' i ; r j for each pair ði; jÞ 2 E of adjacent spots are given by Subsequently, the corresponding joint probability distribution Pð' i ; r j Þ is obtained by differentiating C ρ , We estimate the parameters ρ; μ ' ; μ r by maximizing the log-likelihood of the data across all adjacent spots ði; jÞ 2 E: We use the estimated parameter ρCopula , i.e. the estimated copula correlation coefficient, as our dependency score for solving the CCIIP.

Identifiability
Sklar's theorem states that any continuous multivariate distribution has a unique decomposition as a product of a copula function and its marginals.However, for discrete distributions, this decomposition may not be unique, i.e. there exist multiple different ways to write a discrete distribution as a product of a copula function and its marginals (Inouye et al. 2017).This non-identifiability may lead to inaccurate estimates of the copula correlation coefficient ρCopula (Genest and Ne� slehov� a, 2007).
One standard approach to account for this unidentifiability is with a distributional transform (R€ uschendorf and R€ uschendorf 2013), which involves transforming a discrete marginal distribution Pð' i Þ into a continuous distribution by "smoothing" out the CDF of the distribution.We specifically follow the distributional transform approach described by Kazianka and Pilz (2010); see Supplementary Appendix and Inouye et al. (2017) , Kazianka and Pilz (2010) for details.
For ligand-receptor pairs where both the ligand and the receptor have low (or zero) expression across many spots, the maximum likelihood estimator (MLE) (5) may yield multiple values of the correlation ρ; e.g. if the log-likelihood log Lðρ; μ ' ; μ r j'; r; GÞ is a non-convex function of the correlation ρ.Thus, we exclude from our analysis ligand-receptor pairs where the log-likelihood function log Lðρ; μ ' ; μ r j'; r; GÞ is not a convex function of ρ, i.e. the log-likelihood has a non-positive second derivative as a function of ρ.Specifically, we exclude ligand-receptor pairs where d 2 f dρ 2 j ρ¼ρ 0 ≤ 0 for some −1 ≤ ρ 0 ≤ 1, where f ðρÞ ¼ log Lðρ; μ' ; μr j'; r; GÞ is the loglikelihood function where the ligand and receptor mean expression values μ ' ; μ r , respectively, are set equal to their MLEs from (5) with fixed ρ ¼ 0.

Cell type-specific CCIs
When each spot in the SRT data is annotated with a cell type label, we formulate a cell type-specific CCIIP for identifying CCIs between two specific cell types.Specifically, given a cell type label γ i for each spot i ¼ 1; . . .; N, we identify CCIs between cell types c and c 0 by solving the CCIIP on the graph G c;c 0 ¼ ðV c;c 0 ; E c;c 0 Þ where the vertices V c;c 0 ¼ fi : γ i ¼ c or γ i ¼ c 0 g are the spots with cell type label c or c 0 , and the edges E c;c 0 ¼ fði; jÞ 2 E : γ i ¼ c; γ j ¼ c 0 or γ i ¼ c 0 ; γ j ¼ cg connect a cell with label c to a cell with label c 0 .We note that for technologies with low spatial resolution, such as 10x Visium, a spot is often annotated with a single cell type even though technically a spot may contain multiple cell types.

Statistical significance
After estimating the copula correlation coefficient ρCopula , we aim to test whether the estimated coefficient ρCopula is larger (in absolute value) than expected by chance when cells are randomly arranged in the tissue.We formalize this with a permutation test, where we define the null distribution of correlation values ρ as the distribution of correlation coefficient measures ρCopula;G 0 estimated using graph G 0 , where G 0 is obtained by permuting the spots in the spatial adjacency graph G; that is, G 0 ¼ ðV; E 0 Þ where ði; jÞ 2 E if and only if ðπðiÞ; πðjÞÞ 2 E 0 for a fixed permutation π on the vertices V. We uniformly sample graphs G 0 by permuting the vertices (spots) in the spatial adjacency graph G, and we estimate the p-value as the fraction of graphs G 0 where jρ Copula;G 0 j>jρj.We control the false discovery rate using the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995).We emphasize that for the cell type-specific CCIIP, which is the problem that we solve in many of our applications, our permutation test only permutes spots of a given cell type(s).

CCI visualization
We use the estimated distribution Pð' i ; r j Þ to identify spatial regions where ligand-receptor interactions are taking place.We define the interaction score between two adjacent spatial locations with ligand and receptor expression values ' i ; r j as the Mahalanobis distance z T ij Σ−1 ρCopula z ij between the vector z ij ¼ ½Φ −1 N ð0;1Þ ðF L ð' i ÞÞ; Φ −1 N ð0;1Þ ðF R ðr j ÞÞ� and the bivariate Gaussian distribution N ð0; Σρ Copula Þ.Here, z ij is a transformation of the ligand-receptor expression values ½' i ; r j � to the same scale as the estimated bivariate Gaussian distribution Nð0; Σρ Copula Þ. Intuitively, a larger Mahalanobis distance for spatial locations i, j implies that the ligand and receptor expression values ' i ; r j are substantially larger or smaller than the mean, indicating a potential interaction.

Data pre-processing and implementation
For SRT data measured with 10x Genomics Visium (10x Genomics) (resp.Stereo-seq; Chen et al. 2022), where the spots are organized on a hexagonal (resp.rectangular) grid, then the spatial adjacency graph G is a hexagonal (resp.rectangular) grid graph.For SRT data where the spots are organized in an irregular fashion, e.g.SRT measured using seqFISHþ (Eng et al. 2019), we use Squidpy (Palla et al. 2022) to construct a spatial neighborhood graph using a user-defined distance threshold.In all cases, we also include "self-loops" in the graph G ¼ ðV; EÞ, i.e. edges ði; iÞ 2 E from a spot i to itself.These self-loops account for SRT data where a spot may consist of multiple cells, e.g.data measured with 10x Genomics Visium, and additionally, they describe autocrine signaling, where a ligand of a cell binds to a receptor of the same cell.
Following Li et al. (2023) and Cang et al. (2023), Copulacci uses a set of candidate ligand-receptor pairs specified in CellChatDB (Jin et al. 2021) and further filters out the ligands and receptors that are not expressed in a fixed number of spots using the same filtering criteria as Li et al. (2023).Furthermore, the CellChatDB database often contains heteromeric ligands and/or receptors, meaning the ligand/receptor is a complex consisting of multiple genes.In such cases, we obtain a single expression value by summing the expression values of each gene in the complex.Since Copulacci directly models count data, we do not perform any numerical transformation of the data such as log-normalization.

Simulated count data
We first evaluated Copulacci using simulated non-spatial count data.We drew 500 samples ð' i ; r j Þ of ligand and receptor expression values from a bivariate Gaussian copula distribution with Poisson marginals, where the bivariate Gaussian has a covariance matrix and the Poisson marginals are ' i � PoisðkN i e μ ' Þ and r j � PoisðkN j e μ r Þ for the ligand and receptor, respectively.The parameters N i represent the total UMI count at a spot i, and we set it equal to the total UMI counts from a 10x Visium dataset (Xu et al. 2022, Bergenstråhle et al. 2020).The parameter k controls the sparsity of the simulated data, such that the ligand and receptor expressions ð' i ; r j Þ are more sparse for smaller values of k and vice versa (Fig. 2a).We choose the sparsity parameter k such that the data sparsity falls into three buckets: <10%, 10%-30%, and more than 30%.See Supplementary Appendix for more details.
We compare the copula correlation coefficient estimated by Copulacci to the Pearson and Spearman correlation coefficients, which are standard correlation measures used to quantify CCIs in single-cell and spatial analysis (Armingol et al. 2021).In particular, we computed the Spearman and Pearson correlation coefficients on log-transformed and normalized UMI counts, as is standard in the literature (Li et al. 2023).We note that we do not compare to the Moran's I measure as we do not simulate spatial locations.
Copulacci more accurately estimates the true correlation coefficient ρ compared to the Spearman and Pearson correlation coefficients for all parameter choices (Fig. 2b).In particular, Copulacci has the most improvement over the Pearson and Spearman correlations when the ligand and receptor UMI counts have sparsity of at least 30%, i.e. a similar level of sparsity as many current SRT technologies.We also emphasize that the Copulacci accuracy has noticeably smaller variance compared to the accuracy of the Spearman and Pearson correlation coefficients.This demonstrates that Copulacci is more robust than the Pearson and Spearman correlation measure, especially on sparse data.
Moreover, we observe that the Pearson and Spearman correlations tend to underestimate the true absolute correlation coefficient jρj (Fig. 2c-e).In particular, when there is a large amount of data sparsity (>30%) and the magnitude of the correlation coefficient ρ is large (i.e.jρj>0:5), then the Spearman and Pearson correlations have a much smaller magnitude than the true correlation coefficient, i.e. jρj � jρj.In contrast, Copulacci estimates a correlation coefficient ρCopula with a magnitude close to the true correlation coefficient ρ across all data sparsity levels and coefficients ρ.
These simulations demonstrate that, by using a countbased model, Copulacci more accurately measures correlations between ligands and receptors in sparse SRT data.

Visium data from human breast cancer
We next used Copulacci to infer CCIs from SRT data of a human breast cancer (invasive ductal carcinoma) measured with 10x Genomics Visium (Xu et al. 2022, Bergenstråhle et al. 2020).In this dataset, we observe the expression of 649 ligands and receptors-forming 1315 pairs of interacting ligands and receptors, as obtained from (Li et al. 2023)-in 3798 spots.The data exhibits a high level of sparsity, with the expression of a given ligand or receptor occurring in merely 25% of these locations on average.
Since there is no ground truth, we compare the correlation coefficients ρCopula estimated by Copulacci to those estimated by two other methods, MERINGUE (Miller et al. 2021) and SpatialDM (Li et al. 2023).MERINGUE and SpatialDM estimate variants of the generalized Pearson correlation, which MERINGUE calls spatial cross-correlation (see Section 2).Moreover, since each spot in this dataset is annotated with one of four cell type labels (Xu et al. 2022)-tumor, surrounding tumor, invasive region, and healthy-we analyze cell type-specific CCIs (see Supplementary Table S1), i.e.CCIs between adjacent spots with two pre-specified cell types.
We observe that Copulacci estimates markedly different correlation coefficients compared to SpatialDM and MERINGUE for CCIs between tumor and surrounding tumor cell types (Fig. 3a).In particular, Copulacci estimates substantially larger correlation coefficients than both SpatialDM and MERINGUE.We also find that the ligandreceptor pairs in which Copulacci estimates a larger correlation than SpatialDM or MERINGUE tend to have sparser expression, for example the ligand-receptor pairs IL16-CD4 and CXCL13-CXCR3 (Fig. 3b).Since SpatialDM and MERINGUE use variants of the Pearson correlation coefficients, these results align with our simulation results-where we showed that Pearson correlation underestimates the true correlation for sparse data.
The ligand-receptor pairs with large correlations estimated by Copulacci are involved in several important tumor processes.For example, Copulacci identifies the ligand-receptor pairs CXCL13-CXCR3 and CXCL11-CXCR3 as having a significantly (false discovery rate <0:2) large correlation (ρ Copula ¼ 0:41 and ρCopula ¼ 0:39, respectively) between tumor and surrounding tumor spots (Fig. 3c).The ligand proteins CXCL13 and CXCL11 and receptor protein CXCR3 are chemokines whose interactions are known to mediate the immune response in the tumor microenvironment, e.g.interactions between CXCL13 and CXCR3 recruit suppressive immune cells into tumors in some cancer types (Gao et al. 2021, Gao andZhang 2021).Another ligand-receptor pair that Copulacci identifies as having a significantly large correlation is IL16-CD4.IL16 is a chemoattractant, and increased IL16 expression has been observed to recruit CD4þ protumor macrophages in different breast cancers (Richmond et al. 2014, Hridi et al. 2021, Mathy et al. 2000).We note that existing tools, including SpatialDM and MERINGUE, do not identify these ligand-receptor pairs while Copulacci does.
Copulacci uses the Mahalanobis distance to quantify the interaction strength between a ligand and a receptor at different spatial locations (see Section 2), enabling the visualization of where CCIs take place in a tissue.For example, Copulacci determines the regions at the interface between the tumor and surrounding tumor regions where the interactions between CXCL13 and CXCR3 occur (Fig. 3d).This localization may lead to the identification of biologically interesting spatial cohorts or niches that can be dissected more closely.We emphasize that MERINGUE is unable to visualize these interactions, while SpatialDM is only able to identify large neighborhoods where interactions take place, not individual spatial locations.Overall, our analysis shows that Copulacci measures accurate and biologically meaningful CCIs in the tumor microenvironment.

SeqFISH+ data from mouse cortex
We next applied Copulacci to SRT of the mouse somatosensory cortex measured with seqFISHþ (Eng et al. 2019).In this dataset, the expression of 1157 ligand-receptor pairs are measured in 523 spots.Here, each measured spot is a single cell, in contrast to 10x Visium where each spot contains multiple cells.This SeqFISHþ dataset is less sparse than 10x Visium data: on average, a ligand or receptor is expressed in approximately 30% of the spots.
Copulacci identified several biologically meaningful CCIs between the L5 eNeurons, L6 eNeurons, and astrocytes in the mouse cortex (Supplementary Fig. S1a).In particular, Copulacci identifies several CCIs between astrocytes and L6 eNeurons mediated by ligands and receptors in the Wnt signaling pathway, which is known to regulate many neuronal processes in the mouse cortex including neuronal migration (Bocchi et al. 2017) and synaptic plasticity (Folke et al. 2019).We also highlight the interaction (Supplementary Fig. S1b) between the laminin ligand Lamc3 and the heteromeric integrin receptor Itgav þ Itgb8, which has a substantially larger copula correlation coefficient compared to the correlation coefficients computed by MERINGUE and SpatialDM (Supplementary Fig. S2).Mutations to integrin proteins have been observed to result in abnormal laminar organization of the mouse cortex, suggesting that interactions between laminins and integrins are crucial for mouse cortex function (Anton et al. 1999).This analysis demonstrates how Copulacci reveals biologically meaningful CCIs in highly sparse SRT data which are missed by current approaches.

Stereo-seq data from a mouse embryo
We used Copulacci to infer CCIs from SRT of a mouse embryo (embryonic day E9.5) measured with Stereo-seq (Chen et al. 2022).Copulacci identifies several biologically meaningful interacting ligand-receptor pairs in the mouse embryo that are missed by existing methods (Supplementary Fig. S3), including ligand-receptor pairs at the interface between the brain and neural crest.See Supplementary Appendix for details.

Discussion
The rapid development of SRT technologies opens new avenues for the systematic analysis of high-resolution molecular mechanisms such as CCIs.However, the analysis of current spatial SRT data is challenged by low UMI counts.Existing approaches use correlation measures designed for continuous-valued data and are unable to accurately measure correlations between discrete and sparse UMI counts.
Copulacci resolves this problem by directly modeling the raw UMI counts from ligands and receptors using a Gaussian A count-based model for cell-cell interactions copula model, which can model dependencies in discrete data and accounts for sparsity using Poisson marginals.We showed that Copulacci outperforms the standard correlation measures using both simulated data and real SRT data.In particular, we show that standard correlation measures based on Pearson or Moran's I tend to underestimate the true correlation.As a result, Copulacci identifies many novel CCIs that are missed by existing approaches.Moreover, Copulacci's copula model also allows for the identification of spatial regions where a CCI is taking place.
There are several future directions for Copulacci.First, a major limitation of every existing CCI inference methodsincluding Copulacci-is the lack of a ground truth in most real biological tissues.Thus, it would be useful to apply Copulacci to different samples of the same tissue type, e.g. using different technologies, and create "consensus" lists of probable CCIs that can be used to benchmark CCI inference methods.Second, Copulacci and other CCI inference methods rely on a database of candidate ligand-receptor interactions (Efremova et al. 2020, Jin et al. 2021), which potentially biases algorithms toward ligands and receptors that have been previously studied.It would be useful to extend Copulacci to identify interacting ligands and receptors de novo from sparse SRT data.Third, one could extend Copulacci to model dependencies between other counts derived from SRT data.For example, for technologies with low spatial resolution such as 10x Genomics Visium, it may be useful to model correlations between counts of cell types (e.g.derived from Kleshchevnikov et al. 2022) using a copula with binomial marginals.Finally, the count-based model in Copulacci could be extended to model differential interactions across various cell states and cell types in multiple casecontrol samples, further elucidating the CCIs responsible for shaping the spatial organization of tissues.

Figure 1 .
Figure 1.Overview of Copulacci.(a) A breast cancer SRT dataset with four cell type labels: tumor (T), surrounding tumor (S), invasive (I), and healthy (H).(b) Spatial adjacency graph G ¼ ðV ; EÞ where edges E connect physically adjacent spots, i.e. spots within a predefined distance from one another.(c) Copulacci uses a bivariate Gaussian copula with Poisson marginals to model the distribution of observed ligand and receptor expression values fð' i ; r j Þg ði;jÞ2E across each edge ði; jÞ 2 E in the spatial adjacency graph G.The copula distribution is parameterized by the correlation coefficient ρ of the covariance matrix, which we call the copula correlation coefficient, and the mean expression μ ' ; μ r of the ligand and receptor, respectively.Copulacci computes the maximum likelihood estimators (MLEs) ρ, μ' , μr of the copula correlation coefficient, mean ligand expression, and mean receptor expression, respectively.(d) (Left) Ligand and receptor expression values for spots with tumor (T) cell type label.(Right) Copulacci estimates model parameters ρ, μ' , μr and computes an interaction score (quantified by Mahalanobis distance from the estimated copula distribution) which indicates the strength and direction of cell-cell interactions between adjacent spots.

Figure 2 .
Figure 2. Evaluation of Copulacci on simulated data.(a) 3D histogram of simulated ligand-receptor expression counts ð' i ; r j Þ with correlation ρ and different levels of data sparsity.The x-axis is the ligand expression ' i ; the y-axis is the receptor expression r j ; and the height of each bar is the observed number of pairs ð' i ; r j Þ in the simulated data.(b) Difference ρ−ρ between true correlation ρ and predicted correlation ρ for Copulacci (Copula), Spearman correlation coefficient, and Pearson correlation coefficient, across different simulated instances, grouped by sparsity of simulated ligand and receptor expression values.(c-e) True correlation ρ versus estimated (c) copula correlation coefficient (ρ Copula ) learned by Copulacci, (d) Spearman correlation coefficient (ρ Spearman ), and (e) Pearson correlation coefficient (ρ Pearson ) across different simulated instances.Lines are colored according to data sparsity and are fit by a linear regression.

Figure 3 .
Figure 3. Ligand-receptor interactions inferred from Copulacci in a breast cancer SRT dataset.(a) Scatter plot of the copula correlation coefficient (from Copulacci) versus: (top) the spatial cross-correlation coefficient used by MERINGUE (Miller et al. 2021) and (bottom) the global I score used by SpatialDM (Li et al. 2023) for ligand-receptor pairs ð'; rÞ from tumor to surrounding tumor cell types.The five ligand-receptor pairs highlighted have the largest absolute difference between correlation coefficients.(b) 3D histogram plot of ligand UMI count versus receptor UMI count for ligand-receptor pair CXCL13-CXCR3.(c) Dotplot highlighting the statistically significant ligand-receptor pairs across cell type interactions where one of the participating cell type is tumor.Cell types are abbreviated as tumor (T), surrounding tumor (S), and invasive (I).Color indicates FDR corrected p-values from the Copulacci permutation test.(d) UMI counts and Copulacci directed interaction score for the ligand-receptor pair CXCL13-CXCR3 for spots at the border of the tumor and surrounding tumor regions in the boxed region of the tissue slice.